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ABSTRACT 


Peer assessment is a promising solution for scaling up the 
grading of a large number of submissions. The reliability of 
evaluations is one of the critical issues in peer assessment; 
several probabilistic models have been proposed for obtain- 
ing reliable grades from peers. Peer correction is a similar 
framework, in which students are instructed to correct the 
errors in submissions from other students. Peer correction 
is typically performed simultaneously with peer assessment; 
a reviewer is instructed to correct the errors in a submission 
and to provide a grade to it. We observe the occasional in- 
consistency between a grade and the correction; for example, 
a reviewer provides a high grade for a submission but she 
corrects many errors in it. Such inconsistencies can point to 
unreliable reviewers. In this paper, we propose probabilistic 
models for peer correction, and the combination of the peer 
correction models and the existing peer assessment mod- 
els for capturing the inconsistency to accurately estimate 
the reviewer reliability and the student ability. We conduct 
experiments using the dataset of an actual peer correction 
platform for language translation, and the results demon- 
strate that the combination of peer correction models and 
peer assessment models improves the accuracy of the student 
ability estimation. 
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1. INTRODUCTION 


MOOCs have changed education by offering open access to 
university course materials; however, not everything per- 
formed in offline classes is effectively introduced in MOOCs. 
An example is the ability assessment; in offline classes, 
teachers evaluate the student abilities by examining their 
submitted assignments and decide how to improve the ed- 
ucational efficiency. In contrast, assessing the abilities of 
tens of thousands of students in MOOCs is not feasible for 
teachers. 
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A promising solution for large-scale ability assessment is to 
allow students themselves to be involved in the evaluation; 
instead of teachers, students grade the submissions from 
other students. Such peer assessment approach is beneficial 
for scaling-up the ability assessment and it has been applied 
to several MOOCs courses [7]. However, the reliability of 
evaluations is one of the critical issues in peer assessment 
because some students may provide unreliable evaluations 
owing to laziness or lack of evaluation skills. Several prob- 
abilistic models have been proposed for estimating the re- 
liabilities of the reviewers in order to accurately assess the 
student abilities in peer assessment [7, 4, 11, 8, 13, 6]. These 
models are based on the assumption that students with high 
ability are likely to provide reliable grades. The models are 
used to estimate the ability of a student as a test taker and 
the reliability as a reviewer. 


In a similar framework of peer assessment, called peer cor- 
rection, students correct the errors in the submissions from 
other students. Peer correction is helpful for teachers to 
reduce their efforts for providing feedback to the students. 
Typically, peer correction is performed simultaneously with 
peer assessment; a student is instructed to grade a submis- 
sion and to correct its errors. 


Although the outcomes of peer correction are naturally as- 
sumed to be informative for estimating the student abilities, 
probabilistic models for peer correction have not yet been 
investigated. Based on a natural assumption that a student 
who receives fewer corrections are likely to have a higher 
ability, we propose probabilistic models for peer correction 
that capture the relationship between the student abilities 
and the correction outcomes. 


Additionally, we noticed an inconsistency between the out- 
comes of peer correction and those of peer assessment. In 
one case, a reviewer provides a high grade to a submission 
but she corrects many errors in it; in another case, a reviewer 
assigns a low grade but she does not make any corrections. 
Our idea is that such inconsistencies are beneficial in deter- 
mining unreliable reviewers; thus, we propose to combine 
peer assessment models with our peer correction models. 
This combination allows us to capture the inconsistency and 
to incorporate it into the estimation of the reviewer reliabil- 
ity and the student ability. 


We conduct experiments using a peer correction dataset 
about language translation. The results of the experiments 
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show that our probabilistic models for peer correction are ca- 
pable of estimating the student abilities, and the combined 
models of peer correction and peer assessment demonstrate 
a better performance in determining high-ability students 
than the peer assessment models. 


The contributions of this paper are twofold: (i) we propose 
novel probabilistic models for peer correction that enable us 
to estimate the student abilities from the received correc- 
tions (Section 4), and (ii) we propose to combine our peer 
correction models and peer assessment models to exploit the 
inconsistencies among the outcomes of corrections and as- 
sessments (Section 6); the results of the experiments show 
that the combined models are efficient in accurately estimat- 
ing the student abilities. 


2. PROBLEM DEFINITION 


We begin with the formulation of peer assessment and peer 
correction. We assume there is a set of students S. When 
a student creates a submission for an assignment, other stu- 
dents (that we call reviewers) evaluate it and assign grades. 
The grade for the student u € S assigned by the reviewer 
v € S is denoted by zu, € R. Each reviewer is additionally 
instructed to correct the errors in a submission. A correc- 
tion result is denoted by yuu. If a reviewer does not provide 
any correction for a submission, such information is also em- 
bedded in yu». The representation of yu» is discussed in the 
next section. 


Given a set of peer assessment and peer correction outcomes, 
D, each of which is represented by a tuple (u,v, Zuv, Yur), OUT 
goal is to estimate the true abilities of the students {su}ues, 
where s, €R. 


3. DATASET 


In this work, we use a peer assessment and peer correction 
dataset collected from Conyac’, which is a crowdsourcing 
language translation platform. This platform employs peer 
correction and peer assessment between translators for col- 
laboratively improving their skills; thus, a translator on this 
platform can be considered as a student. When a student 
submits a translation, other students evaluate its quality on 
a five-point scale (zero (low) to four (high)) and correct the 
errors in it. Students are invited to high-reward jobs if they 
have reviewed several submissions. 


Students on Conyac can take a qualification test to demon- 
strate their skills. On this test, a student is instructed to 
translate the given sentences and then the translations are 
evaluated by experts employed by the service provider. Ac- 
cording to the score, a student is assigned one of five exper- 
tise levels (D, C, B, A, and A+). This level is used for the 
job assignment and the default level is set to one. We con- 
sider the assigned levels as the ground truth of the student 
abilities, that we aim to estimate from the outcomes of peer 
assessment and peer correction. 


We target the peer assessment and peer correction for 
Japanese to English translations on Conyac. Our dataset 
contains 5,008 reviews for 413 students, and 135 students 


Thttps://conyac.cc/ 


provide at least one review. Figure 1(a) shows the distribu- 
tion of the grades assigned to translations and Figure 1(b) 
illustrates the distribution of the students’ true expertise 
levels. 


We conduct exploratory data analysis to investigate how 
the outcomes of peer correction can be used for estimating 
student expertise levels. A natural expectation is that a stu- 
dent whose submissions are likely to be corrected would have 
lower ability. We calculate the correction ratio of each stu- 
dent, which is the number of corrected submissions divided 
by the number of submissions. Figure 1(c) shows the aver- 
age correction ratio of the students in each expertise level. 
We observe that students with the highest level are likely to 
have lower correction ratios than the others. 


Additionally, we consider that students who have more er- 
rors in their submissions would be have lower ability. We cal- 
culate the number of corrected parts in each submission by 
applying the Gestalt pattern matching [10]. We first obtain 
the matched patterns in pre-correction and post-correction 
submissions, and then count the number of unmatched pat- 
terns in the post-correction submissions. The examples of 
the calculated numbers are shown in Table 1 and Figure 1(d) 
shows the distribution of the number of corrected parts in 
each submission. We calculated the average number of cor- 
rected parts of each student and Figure 1(e) presents the 
average of the values at each level. We found that the stu- 
dents with higher levels are likely to have a lower number of 
corrected parts. 


From these observations, we decide to use the following bi- 
nary and numerical variables to represent a correction out- 


come: (1) y{?) € {0,1}, which indicates whether the corre- 


sponding submission is corrected by the grader (yf? = 0) or 
not (y(t) = 1), (2) yw) € {0,1,2,...,}, which indicates the 
number of parts corrected by the grader. 


4. PEER CORRECTION MODELS 


We propose two peer correction models, PC, and PCn, for 
estimating the student true abilities. The models are illus- 
trated in Figure 2(a). 


4.1 PC, model 


We first present a generative model for y? € {0,1}, which 
is a binary indicator whether the submission has been cor- 
rected by the reviewer or not. We have two latent parame- 
ters into our model, that is, student true ability and reviewer 
bias; each student is associated with the latent true ability, 
Su © R, which we aim to estimate, and each reviewer has a 
different bias parameter, b, € R, presuming that a reviewer 
with a lower bias tends to review a submission negatively. 


Following the observations, we assume that a submission 
from a student is likely to be not corrected by a reviewer 
if the student has high ability. In addition, a reviewer is 
not likely to correct a submission if he/she has a higher 
bias. These assumptions are represented as the following 
generative model: 


ys) ~ Bern (ue? 


o (Sy + by +r), (1) 


where o(a) = 1/ (1+ exp(—2)), Bern(-) is the Bernoulli dis- 
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Figure 1: Statistics of our dataset 


Table 1: Examples of the calculated number of corrected parts. The corrected parts are highlighted (modified parts are 
highlighted in yellow, added parts are highlighted in pink, and removed parts are highlighted in blue). There were seven 
corrected parts in the last example because there were three modified parts (“Current members,” “are” and “was working on 


6609 


the”), two added parts (“female” and “, 


Please — enter title. Please enter 


41 people from major travel agencies of Japan and land 
operators participated and had the business meetings with 
suppliers about latest Thailand MICE circumstances. 


Kalafina is the vocal band produced by Yuki Ka- 


jiura. aes of Kalafina is. WAKANA, 
KEIKO and HIKARU. The group was formed in order 


to produce the main song when ‘the composer Yuki Ka- 


, and two removed parts (“girls” and “the”). 
& 


the the the the 
1 


Please enter title. Please enter 
41 people from major travel agencies of Japan and land 


operators participated and had the business meetings with 
suppliers about latest Thailand MICE circumstances. 
Kalafina is the vocal band produced by Yuki Ka- 
jiura. (Carreismemibess| of Kalafina WAKANA, 
KEIKOgg an . The group was 


ormed in or- 
der to produce the main song when composer Yuki Ka- 


Num. of 
corrected 
parts 


jimura produced music for the film ‘ 


ness”. 


‘Boundary of Empti- Jira. : 
of Emptiness”. 


music for the film “Boundary 


tribution, and r is a noise. Note that y> = 1 indicates 
that the corresponding submission is not corrected by the 
reviewer. We denote this generative model by PCy model. 
We can interpret s, +6, +r as an apparent ability of the 
student u for the reviewer v at the time. The model indi- 
cates that a submission is likely to be not corrected when 
the apparent ability is high. 


In the same way as the existing peer assessment models that 
will be reviewed in the next section, we use normal distri- 
butions as priors for sz, by, and r: 


(Student ability) su. ~N (su|uo, 1/70) (2) 
(Reviewer bias) by, ~ WN (b,|0, 1/70) (3) 
(Noise) r  ~ N(r|0,1/ko), (4) 


where jto, Yo, 70, and Ko are hyperparameters. 
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4.2 PC. model 

Our second model targets the number of corrected parts in 
each correction, y® € {0,1,2,...,}. Following the obser- 
vations from the actual dataset, we assume that a reviewer 
corrects more parts of a submission when the student has a 
lower ability. We use the Poisson distribution to represent 
this assumption: 


(n) 


ye (n) 


~ Poisson (us 


1 
exp (Su + by =) : 


Similar to the PCy model, sy + 6b, + r is considered as the 
apparent ability of the student u to the reviewer v, and this 
model indicates that more parts of the submission is likely 
to be corrected by the reviewer if the apparent ability of 
the student is lower. We call this model PC,. The priors 
given in Eqs. (2), (3), and (4) are incorporated into the PCy 
model as well. 
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Figure 2: Peer correction and peer assessment models, and combined models 


5. PEER ASSESSMENT MODELS 


We next review the existing peer assessment models, which 
are combined with our peer correction models in the next 
section. In particular, we summarize the PG, [7], PGs [7], 
PG, [6], and PGs [6] models. These are the generative 
models for grades, zu» € R. The peer assessment models 
are illustrated in Figures 2(b) and 2(c). 


The student ability and the reviewer bias parameters are 
also incorporated in the peer assessment models. All the 
models use the same priors given in Eqs. (2) and (3). 


5.1 PG, and PG; models 

In addition to the latent parameters incorporated in the peer 
correction models (s,, and b,), the peer assessment models 
contain the reviewer reliability 7, € R*. This parameter 
indicates how likely a grade given by the reviewer contains 
a noise. PG, is defined as follows: 


(Reviewer reliability) T ~ Gamma(7,|Q0, 30) 


(Outcome) Zuv ~ N (Zuv|8u + bv, 1/7v) , 


where ao and ($0 are hyper parameters. PG3 is an exten- 
sion of PGi, which incorporates the relationship between 
the reviewer reliability and the ability of the reviewer (as a 
student). PG3 is given as follows: 


Ty = 018 + 0 
Zuv ~N (Zuv|Su + by, 1/tTw) , 


(Reviewer reliability) 
(Outcome) 


where 0p and 0; are hyperparameters. 


5.2 PG, and PG; models 
PG, and PGs are variations of PGg3 and they incorporate 
the relationship between the reviewer reliability and the re- 
viewer ability into the priors of the reliability parameter. 
The generative models of the reviewer reliability and out- 
come in PG, are given as follows: 


(Reviewer reliability) 
(Outcome) 


tT ~ Gamma(r,|s,, 80) 
Zuy ™ N (Zuv|Su + by, 1/Tw) , 


and those in PGs are given as follows: 


Ty ~ N(tw|8v; 1/80) 
Zuv ~N (Zuv|Su + by, A/Tw) , 


(Reviewer reliability) 


(Outcome) 


where {o and X are hyperparameters. 


2We do not include PG» [7], which is almost similar to PG 
except it incorporates time-series factors. 
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6. COMBINED MODELS FOR PEER COR- 
RECTION AND PEER ASSESSMENT 


We finally combine our peer correction models and the exist- 
ing peer assessment models. By combining these two types 
of models, we expect to capture an inconsistency between 
the outcome of peer correction and that of peer assessment; 
the inconsistency can be informative for estimating the re- 
viewer reliabilities. 


We use PG, and PC; to explain the model combining and we 
term the combined model as PGi+PC,. We simply consider 
that s, and b, are shared between these two models; namely, 
the generative model for PGi+PCy is given as: 


Su ~ N (Su|Ho, 1/70) 

T ~ Gamma(tz|Q0, Bo) 

by ~ N (by|0, 1/70) 

r~ N(r|0,1/Ko) 

Zuv ~ N (Zuv|8u + bv, 1/Tv) , and 


ys) ~ Bern (yi?|o (Su + by + r)) ; 


(Student ability) 
(Reviewer reliability 
(Reviewer bias 
(Noise 


(Outcomes 


) 
) 
) 
) 


The PGi+PCy model is illustrated in Figure 2(d). 
combined models are defined similarly as PGi;+PCp. 


Other 


When an inconsistency occurs between corrections and 
grades, i.e., a reviewer provides a high grade to a submission 
but makes many corrections in it, we consider that a large 
noise occurs on the grade (zuv) and thus the reliability of 
the reviewer (ty) is estimated as low. The combination of 
peer assessment models and peer correction models allows 
us to leverage such inconsistencies to estimate the reviewer 
reliabilities and the student abilities. 


7. EXPERIMENTS 
We conduct experiments using the actual peer assessment 
and peer correction dataset about language translation. We 
investigate the effectiveness of the proposed methods to es- 
timate the student abilities. 


7.1 Baselines 

We compare the proposed models (PCp, PCn, and 
PG {4\3,4,5} +PCp,n}) with the following baselines: (a) Cor- 
rection ratio (PC): this is a naive version of PC, 
and considers the correction ratio of each student as the 
ability. Specifically, the correction ratio is defined as 
=P Glee) ") (ui? = =0) /\v”|, where yi”) is the set of 
correction outcomes for the student u, and 6(-) is the in- 
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Table 2: Average and standard deviation of AUC scores of each method on various classification boundaries. Each column 
indicates the results for each classification boundary; for example, (D,C,B, A | A+) represents the results for classifying the 
students at A+ and the others. The winner for each boundary is bold-faced. The cases where a combined model outperforms 
the corresponding peer assessment model (PGi, PG3, PGa, or PGs) are underlined. 


PG5+PC, 


dicator function. For assigning a higher ability for a stu- 
dent with less corrections, we multiply the value with —1. 
(b) Mean number of corrected parts (PC!): this is 
a naive version of PC, and considers the mean number of 
the corrected parts of each student as the ability. Specifi- 
cally, the mean number of the corrected parts of the student 
u is defined as = Vy Mey yo) /\V), where yn) is the 
set of correction outcomes for the student u. For assigning 
a higher ability for a student with less corrected parts, we 
multiply with —1. (c) Mean grades (PG*): this is a naive 
version of PG; and considers the mean assigned grades of 
each student as the ability. The mean grade is defined as 
see Zuv/|Zu|, where Z, is the set of grades assigned to 
the student u. (d) PGi, PGs, PGu, and PGs: existing 
peer assessment models. 


7.2 Experimental setup 

We implemented the models using the No-U-Turn Sam- 
pler (NUTS) [3], which is a variation of the Hamiltonian 
Monte Carlo. We executed four chains and they produce 
5,000 samples in total. The initial 500 samples were ignored 
and the average of the rest samples were used as the esti- 
mated parameters. 


We randomly generated 150 sets of candidate hyperparam- 
eters for each method. A method with a set of candidate 
hyperparameters produces the estimated student abilities. 
Their performance was evaluated using the groundtruth of 
20% of the students. We then decided the best set of hy- 
perparameters for the method and the final result for each 
method was evaluated by the remaining students. We per- 
formed this procedure five times and calculated the average. 


Each method outputs the estimated ability of each student. 
We use the expertise levels assessed by the experts as the 
ground truth, and investigate how accurately each method 


AUC 


La D,C,B, A|A+) | (D,0,B|A, A+) | (D,C] B.A, A+) | (D1C,B,A, At 


classifies the students with high expertise and those with 
low expertise. We specifically use the area under the ROC 
curve (AUC) as an evaluation metric. 


7.3 Results 

Table 2 shows the AUC scores of each method on different 
classification boundaries. Our peer correction models (PCp 
and PC,) demonstrate better or comparative performance 
to the existing peer grading models in detecting the students 
at the highest level; this supports the effectiveness of the 
peer correction results for estimating student abilities. We 
see that the “no-correction” cases only occur for high-ability 
students and the correction information is helpful for distin- 
guishing between the “perfect students” and “almost perfect 
students”, both are likely to obtain the highest grades from 
the reviewers and the correction outcomes are required to 
classify them. 


In contrast, the performance of peer correction models be- 
comes inferior for detecting the students at lower levels, and 
PG? achieves the best performance for detecting the stu- 
dents at the lowest level; the average of the obtained grades 
is sufficiently informative for detecting low-ability students. 
Our methods would be beneficial for a situation where teach- 
ers aim to detect students who require advanced course ma- 
terials or assignments. 


The combined models of PG41,3,4.5;+PC, outperform the 
corresponding PG, 1.3.45; in most cases; the outcomes of 
peer correction are useful for improving the student abil- 
ity estimation. It is noteworthy that PGs+PCp achieves an 
AUC of 0.914 for classifying the students at A+ and the 
others. This result is brought by the capability of the com- 
bined models for capturing the inconsistencies between the 
outcomes of assessments and those of corrections. 
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The number of corrected parts can be more informa- 
tive than simply considering whether a submission is cor- 
rected; in fact, PC, is better than PC, and PC?, performs 
better than Pci; however, PGy13,.4,5;+PCp outperforms 
PG qi,3,4,53+PCn in our experiments. Because there are 
more model variations in PCy than PCy, a more meticu- 
lous modeling for combining the PG models and the PCy 
model would be required. 


8. RELATED WORK 


Peer assessment models are categorized into two groups: 
models for cardinal peer assessment and models for ordinal 
peer assessment. The former models target a situation where 
the outcomes are assigned in explicit numerical scores, such 
as five-point scores. In addition to the probabilistic models 
reviewed in Section 5, Walsh proposed PeerRank [13], an 
extension of PageRank for peer assessment. In ordinal peer 
assessment, each grader is shown multiple submissions and 
instructed to rank them. The Bradley—Terry model [2] has 
been applied for ordinal peer assessment [11, 8] and Mi et 
al. proposed to use the cardinal peer assessment models for 
ordinal peer assessment [6]. Although several probabilistic 
models for peer assessment have been studied, peer correc- 
tion has not yet been investigated. 


The design of peer assessment frameworks has been at- 
tempted to improve the reliability of evaluation. Kulkarni 
et al. ((4]) reported that the feedback about the grading bias 
to graders was beneficial for improving the reliability. An- 
other work proposed to design peer assessment as a multiple 
choice task where a grader is instructed to choice the best 
submission [5]. Peer assessment mechanisms based on game 
theory have been introduced to derive accurate evaluations 
from peers [14]. 


Our peer correction models are very related to the models 
studied in the item response theory, which are for quantify- 
ing student abilities and item characteristics in educational 
tests. One of the simple item response theory model is the 
Rasch model [9] and our PC; (given in Eq.(1)) model has a 
similar formulation to the Rasch model. 


Besides peer assessment, probabilistic models for estimat- 
ing grader reliability have been studied in crowdsourcing 
as well. Specifically, a two-stage framework was proposed 
where crowdsourcing workers in the first stage produce out- 
puts, such as translations or logo designs, and another set of 
workers in the second stage evaluates the outputs [1]. Prob- 
abilistic models for estimating the reliability of each grader 
and the quality of each output in this two-stage framework 
have been proposed [1, 12]. Unlike peer assessment, the over- 
lap between students (i.e., creators of outputs) and graders 
is not assumed in crowdsourcing. 


9. CONCLUSIONS 


We presented probabilistic models for peer correction, which 
are used for estimating the student abilities. We proposed 
two models: one considering whether a grader has corrected 
a submission, and the other utilizing the number of corrected 
parts in each submission. We also combined the peer cor- 
rection models with the peer assessment models; this com- 
bination allows us to estimate the reliability of graders from 
the outcomes of peer corrections and those of peer assess- 


ment by considering the consistency between the corrections 
and assessments. The experiments using the actual dataset 
of peer correction showed that the combination of peer cor- 
rection models and peer assessment models was particularly 
effective in detecting high ability students. 


In our models, we did not consider the importance of each 
corrected part; however, the importance levels differ among 
corrected parts in which minor corrections (e.g., adding a 
punctuation mark) and major corrections (e.g., paraphras- 
ing) exist. A major correction would indicate the low quality 
of a submission and considering such factors is a promising 
direction to improve the ability estimation accuracy. 
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